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OS Abstract 
O 

Higher Criticism is a method for detecting signals that are both sparse and weak. Al- 
04 though first proposed in cases where the noise variables are independent, Higher Criti- 

cism also has reasonable performance in settings where those variables are correlated. 
1^ In this paper we show that, by exploiting the nature of the correlation, performance 

tin can be improved by using a modified approach which exploits the potential advantages 

that correlation has to offer. Indeed, it turns out that the case of independent noise 
is the most difficult of all, from a statistical viewpoint, and that more accurate signal 
detection (for a given level of signal sparsity and strength) can be obtained when cor- 
^ I relation is present. We characterize the advantages of correlation by showing how to 

incorporate them into the definition of an optimal detection boundary. The boundary 
has particularly attractive properties when correlation decays at a polynomial rate or 
the correlation matrix is Toeplitz. 
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cn 

^5 1 Introduction 

cn 

Donoho and Jin [2] developed Tukey's [36] proposal for "Higher Criticism" (HC), showing 
that a method based on the statistical significance of a large number of statistically sig- 
<^ nificant test results could be used very effectively to detect the presence of very sparsely 

J> distributed signals. They demonstrated that HC is capable of optimally detecting the pres- 

ence of signals that are so weak, and so sparse, that the the signal cannot be consistently 
estimated. Applications include the problem of signal detection against cosmic microwave 
background radiation (Cayon et al. [8], Cruz et al. [I2], Jin [271 EHl EH], Jin et al. [H]). 
Related work includes that of Cai et al. [7], Hall et al. |20] and Meinshausen and Rice [32*. 

The context of Donoho and Jin's [l^ work was that where the noise is white, although 
a small number of investigations have been made of the case of correlated noise (Hall et 
al. [20], Hall and Jin [21], Delaigle and Hall [TS])- However, that research has focused on 
the ability of standard HC, applied in the form that is appropriate for independent data, to 
accommodate the non-independent case. In this paper we address the problem of how to 
modify HC by developing innovated Higher Criticism (iHC) and showing how to optimize 
performance for correlated noise. 
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Curiously, it turns out that when using the iHC method tuned to give optimal per- 
formance, the case of independence is the most difficult of all, statistically speaking. To 
appreciate why this result is reasonable, note that if the noise is correlated then it does not 
vary so much from one location to a nearby location, and so is a little easier to identify. In 
an extreme case, if the noise is perfectly correlated at different locations then it is constant, 
and in this instance it can be easily removed. 

On the other hand, standard HC does not perform well in the case of correlated noise, 
because it utilizes only the marginal information in the data, without much attention to 
the correlation structure. Innovated HC is designed to exploit the advantages offered by 
correlation, and gives good performance across a wide range of settings. 

The concept of the "detection boundary" was introduced by Donoho and Jin [T3] in 
the context of white noise. In this paper, we extend it to the correlated case. In brief, the 
detection boundary describes the relationship between signal sparsity and signal strength 
that characterizes the boundary between cases where the signal can be detected, and cases 
where it cannot. In the setting of dependent data, this watershed depends on the corre- 
lation structure of the noise as well as on the sparsity and strength of the signal. When 
correlation decays at a polynomial rate we are able to characterize the detection boundary 
quite precisely. In particular, we show how to construct concise lower /upper bounds to 
the detection boundary, based on the diagonal components of the inverse of the correlation 
matrix, S^. A special case is where S„ is Toeplitz; there the upper and the lower bounds 
to the detection boundary are asymptotically the same. In the Toeplitz case, the iHC is 
optimal for signal detection but standard HC is not. 

The paper is organized as follows. Section [2] introduces the sparse signal model followed 
by a brief review of the uncorrelated case. Section |3] establishes lower bounds to the de- 
tection boundary in correlated settings. Section |4] introduces innovated HC and establishes 
an upper bound to the detection boundary. Section [5] applies the main results in Sections 
[3] and |4] to the case where the S^s are Toeplitz. In this case, the lower bound coincides 
with the upper bound and innovated HC is optimal for detection. Section [6] discusses a 
case where the signals have a more complicated structure. Section [7] investigates a case 
of strong dependence. Simulations are given in Section |8j and discussion is given in Sec- 
tion [9j Sections 10 11 and 12 give proofs of theorems, lemmas, and secondary lemmas, 
correspondingly. 



2 Sparse signal model, review of HC in uncorrelated case 

Consider an n-dimensional Gaussian vector 

X = fi + Z where Z~N(0,S), (2.1) 

with the mean vector /u unknown and the dimension n large. In most parts of the paper, 
we assume that S = S,„ is known and has unit diagonal elements (the case where S„ is 
unknown is discussed in Section [9| . We are interested in testing whether no signal exists 
(i.e. // = 0) or there is a sparse and faint signal. 

Such a situation may arise in many situations. One example is global testing in linear 
models. Consider a linear model Y ~ N{M fi, In), where the matrix M has many rows and 
columns, and we are interested in testing whether ^ = 0. The setting is closely related to 



Model (2.1), since the least square estimator of /x is distributed as N(^, (M'M) The 



global testing problem is important in many applications. One is that of testing whether a 
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clinical outcome is associated with the expression pattern of a pre-specified group of genes 
(Goeman et al. \n\ I18|). where M is the expression profile of the specified group of genes. 
Another is expression quantitative Trait Loci (eQTL) analysis, where M is related to the 
numbers of common alleles for different genetic markers and individuals (Chen et al. |9]). 
In both examples, M is either observable or can be estimated. Also, it is frequently seen 
that only a small proportion of genes is associated with the clinical outcome, and each gene 
contributes weakly to the clinical outcome. In such a situation, the signals are both sparse 
and faint. 



Back to Model (2.1). We model the number of nonzero entries of /i as 

m = n^"^, where /? G (1/2,1). (2.2) 



This is a very sparse case for the proportion of signals is much smaller than Xj^fn. We 

suppose that the signals appear at m different locations — 1\ < ^2 < • • • < — that are 
randomly drawn from {1, 2, . . . , n} without replacement, 

P{i\ = ni,^2 = ^^2, • • • = "-ml = \\ , for all 1 < rii < 71,2 < . . . < 71 , (2.3) 

\m J 

and that they have a common magnitude of 



An = \/ 2r log n where r E (0, 1). 

We are interested in testing which of the following two hypotheses is true: 

Hq : fi = vs. h["'^ : /u is a sparse vector as above. (2-4) 

This testing problem was found to be delicate even in the uncorrelated case where 
= In- See |14J (also [71 |23l 1231 |27l |32]) for details. Below, we briefly review the results 
in the uncorrelated case. 

2.1 Detection boundary (S„ = /„) 

The testing problem is characterized by the curve r = p*{l3) in the (3-r plane, where 

[5-1/2, l/2</3<3/4, 

and we call r = p*{(5) the detection boundary. The detection boundary partitions the (5-r 
plane into two sub-regions: the undetectable region below the boundary and the detectable 
region above the boundary; see Figure [TJ In the interior of the undetectable region, the 
signals are so sparse and so faint that no test is able to successfully separate the alternative 



hypothesis from the null hypothesis in (2.4): the sum of Type I and Type II errors of any 
test tends to 1 as n diverges to oo. In the interior of the detectable region, it is possible 
to have a test such that as n diverges to oo, the Type I error tends to and the power 
tends to 1. (In fact, Neyman-Pearson's Likelihood Ratio Test (LRT) is such a test.) See 
[H EH ET] for example. 

The drawback of LRT is that it needs detailed information of the unknown parameters 
(/3, r). In practice, we need a test that does not need such information; this is where HC 
comes in. 
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Figure 1: Phase diagram for the detection problem in the uncorrelated case. The detection 
boundary separates the (3-r plane into the detectable region and the undetectable region. 
In the estimable region, it is not only possible to reliably tell the existence of nonzero 
coordinates, but is also possible to identify them individually. 



2.2 Higher Criticism and its optimal adaptivity (S„ = /„) 

A notion that goes back to Tukey [36] , Higher Criticism was first proposed in [T^ to tackle 
the aforementioned testing problem in the uncorrelated case. To apply Higher Criticism, 
let pj = P{|N(0, 1)1 > l^jl} be the p-value associated with the j-th observation unit, and 
let be the j-th p-value after sorting in ascending order. The Higher Criticism statistic 
is defined as 



HC:= . max j Mj/n-Pp,] 



0-:l/n<P(,)<l/2}l^;j-J^rr^ 



There are also other versions of HC; see [121 dS] for example. When Hq is true, HC* 
equals in distribution to the maximum of the standardized uniform stochastic process ^ 
Therefore, by a well-known result for empirical processes ^33] . 



HC* 

1 in probability. (2.7) 



y^21oglogn 

Consider the HC test which rejects the null hypothesis when 

HC* > (1 -|- a)\/2 log logn where a > is a constant. (2. 



It follows from (2.7) that the Type I error tends to as n diverges to oo. For any parameters 
(/?, r) that fall in the interior of the detectable region, the Type II error also tends to 0. 
This is the following theorem, where we set a = 0.01 for simplicity of presentation. 

Theorem 2.1 Consider the HC test that rejects Hq when HC* > l.Ol-v/2 log logn. For 

every alternative h[^^ where the the associated parameters {r, P) satisfy r > p*{(3), the HC 
test has asymptotically full power for detection: 

P (n){Reject Hq} 1 as n ^ oo. 

1 
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That is, the HC test adapts to unknown parameters (/?, r), and yields asymptotically full 
power for detection throughout the entire detectable region. We call this the optimal 
adaptivity oiHC [14] . 

Theorem 2.1 is closely related to (14[ Theorem 1.2], where a mixture model is used. 



The mixture model reduces approximately to the current model if we randomly shuffle the 
coordinates of X. However, despite its appealing technical convenience, it is not clear how 
to generalize the mixture model from the un correlated case to general correlated settings. 



Theorem 2.1 is a special case of Theorem 4.2 



We now turn to the correlated case. In this case, the exact "detection boundary" may 
depend on S„ in a complicated manner, but it is possible to establish both a tight lower 
bound and a tight upper bound. We discuss the lower bound first. 



3 Lower bound to the detectability 

To establish the lower bound, a key element is the theory in comparison of experiments 
(e.g. [34]), where a useful guideline is that adding noise always makes the inference more 
difficult. Thus, we can alter the model by either adding or subtracting a certain amount of 
noise, so that the difficulty level (measured by the Hellinger distance, or the x^-distance, 
etc., between the null density and the alternative density) of the original problem is sand- 
wiched by those of the two adjusted models. The correlation matrices in the latter have 
a simpler form and hence are much easier to analyze. Another key element is the recent 
development of matrix characterizations based on polynomial off-diagonal decay, where it 
shows that the inverse of a matrix with this property shares the same rate of decay as the 
original matrix. 



3.1 Comparison of experiments: adding noise makes inference harder 

We begin by comparing two experiments that have the same mean, but where the data from 
one experiment are more noisy than those from the other. Intuitively, it is more difficult 
to make inference in the first experiment than in the other. Specifically, consider the two 
Gaussian models 

X = ii + Z, Z~N(0,S) and X* = ^jl + Z\ Z*~N(0,S*), (3.1) 

where fi is an n- vector that is generated according to some distribution G = Gn- The 
second model is more noisy than the first, in the sense that S* > S; see the definition 
below. 

Definition 3.1 Consider two matrices A and B. We write A > B if A — B is postive 
semi-definite. 



The second model in (3.1 ) can be viewed as the result of adding noise to the first. Indeed, 
defining A = S* — S, taking ^ to be A^(0, A) (independently of Z), and noting that ~ 
N(0, S -|- A), the second model is seen to be equivalent to X -|- ^ = fi + {Z + S^). Intuitively, 
adding noise generally makes inference more difficult. This can be stated precisely by 
comparing distances between distributions, for example Hellinger distances. In detail, if 
we denote the Hellinger distance between X and Z by Hn{X, Z; fj,, S„), and that between 
X* and Z* by Hn{X* , Z*; /i, S*), then we have the following theorem, which is proved in 
Section [Tol 



Theorem 3.1 Suppose S„ < in (3.1). Then i7„(X, Z; /x, S„) > Hn{X* , Z*; fi,^^] 
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3.2 Inverses of matrices having polynomial off-diagonal decay 

Next, we review results concerning matrices with polynomial off-diagonal decay. The main 
message is that, under mild conditions, if a matrix has polynomial off-diagonal decay, then 
its inverse as well as its Cholesky factorization (which is unique if we require the diagonal 
entries to be positive) also have polynomial off-diagonal decay, and with the same rate. 
This beautiful result was recently obtained by Jaffard [25]; also see |19| I35j . In detail, let 
Z be the set of all integers. Write £^ for the set of summable sequences x = {xfcjfcez, and 
let A = (A(j, k))j^f:^x be an infinite matrix. Also, let \x\2 be the £^-vector norm of x, and 
||j4|| be the operation norm of A: \\A\\ = supj^. . |^x|2. Fixing positive constants A, 

M, and cq, we define the class of matrices 

eoo{X,co,M) = !^A = {A{j,k)),,k&: |A(i, fc)| < P|| > coj . (3.2) 

The following lemma follows directly from |35] . 

Lemma 3.1 Fix A > 1, cq > 0, and M > 0. For any matrices A S 0oo(A,M), there is a 
constant C > 0, depending only on A, M and cq, such that \ A~^{j, k)\ < C ■ {1 + \j — k\)~'^. 

Now, consider a sequence of matrices of finite but increasingly larger sizes, where the 
entries have a given rate of polynomial off-diagonal decay and where the operator norm 
is uniformly bounded from below. Then the same rate of polynomial off-diagonal decay 
holds for their inverses, as well as for the inverse of their Cholesky factorizations. In detail, 
writing B„ for the set of n x n correlation matrices, we introduce the set of matrices 

e;(A,co,M) = {s„, Ge„: |S„(i,A;)| < M(1 + |j-A;|)-\ > co}. (3.3) 



The following corollary follows from Lemma |3.1| and is proved in Section 11 

Lemma 3.2 Fix A > 1, cq > 0, and M > 0. For any sequence of matrices S„, n > 1, such 
that Tin G @n{^,co, M), let Un be the inverse of the Cholesky factorization ofT,n- There is 
a constant C = C(A, cq, M) > such that for any n and any 1 < j, k < n, 



|S;l(i, k)\<C-{l + \j- k\r\ \Un{j, k)\<C-{l + \j- k\)-\ 

When A = 1, the first inequality continues to hold, and the second holds if we adjoin a logn 
factor to the right hand side. 

3.3 Lower bound to the detectability 

We are now ready for the lower bound. Consider a sequence of matrices S„ G ©^^(A, cq, M). 
Suppose the extreme diagonal entries of have an upper limit < 70 < 00, i.e. 

liiE ( max {T-\k,k)]] = %. (3.4) 

Recall that the detection boundary in the uncorrelated case is r = p*{l3). The following 
theorem says that if we re-scale and write i" = ^y^^ ■ p*{P), then we obtain a lower bound. 

Theorem 3.2 Fix /3 G (1/2, 1), r G (0, 1), A > 1, cq > 0, a nd M > 0. Consider a sequence 



of correlation matrices S„ G 0*(A,co,M) that satisfy (3.4). If r < 7q /?*(/?), then the null 



hypothesis and alternative hypothesis in (2.4) merge asymptotically, and the sum of Type I 



and Type II errors of any test converges to 1 as n diverges to 00. 

We now turn to the upper bound. The key is to adapt the Higher Criticism to correlated 
noise and form a new statistic — innovated Higher Criticism. 
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4 Innovated Higher Criticism, upper bound to detectability 



Originally designed for the independent case, standard HC is not really appropriate for 
dependent data, for the following reasons. First, HC only summarizes the information that 
resides in the marginal effects of each coordinate, and neglects the correlation structure of 
the data. Second, HC remains the same if we randomly shuffle different coordinates of X. 
Such shuffling does not have an effect if E„ = /„, but does otherwise. In this section we 
build the correlation into the standard Higher Criticism and form a new statistic — innovated 
Higher Criticism (iHC). We then use iHC to establish an upper bound to detectability. The 
iHC is intimately connected to the well-known notion of innovation in time series f6] (see 



(4.1) below), hence the name innovated Higher Criticism. 

Below, we begin by discussing the role of correlation in the detection problem. 

4.1 Correlation among different coordinates: curse or blessing? 



Consider Model (2.1) in the two cases S„ = /„ and S„ ^ In- Which is the more difficult 
detection problem? 

Here is one way to look at it. Since the mean vectors are the same in the two cases, 
the problem where the noise vector contains more "uncertainty" is more difficult than the 
other. In information theory, the total amount of uncertainty is measured by the differential 
entropy., which in the Gaussian case is proportional to the determinant of the correlation 
matrix [11^. As the determinant of a correlation matrix is largest when and only when it 
is the identity matrix, the uncorrelated case contains the largest amount of "uncertainty" 
and therefore gives the most difficult detection problem. In a sense, the correlation is a 
"blessing" rather than a "curse" , as one might have expected. 

Here is another way to look at it. For any positive definite matrix S„, denote the 



inverse of its Cholesky factorization by Un = Un{^n) (so that Un'^nUn = In)- Model (2.1) 
is equivalent to 

UnX = Unfi + UnZ where C/„Z ~ N(0, (4.1) 

(In the literature of time series [6^, UnX is intimately connected to the notion of innovation) . 
Compared to the uncorrelated case, i.e. 

X = fi + Z where Z ~ N(0, /„). 

The noise vectors have the same distribution, but the signals in the former are stronger. In 
fact, let ii < £2 < - - - < im be the m locations where ^ is nonzero. Recalling that fij = An 
if j G {^1,^2, - - ■ , ^m}) IJ-j = otherwise, and that C/„ is a lower triangular matrix. 



k f k—1 >. 

(C/„Mk = AnY,Un{lk,h) = AnUni^kAk) + An \ ^C/„(^„4) [■ (4.2) 



Two key observations are as follows. First, sincG Xj-j^ has unit diagonal entries, every diagonal 
entry of Un is greater than or equal to 1, and especially, 

c/„(4,4)>i. (4.3) 

Second, recall that m <^ n, and {^i, 4, ■ ■ - , ^m} are randomly generated from {1,2,..., n}, 
so different (.j are far apart from each other. Therefore, under mild decay conditions on 

C/n(^„4)~0, i = l,2,...,fc-l. (4.4) 
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Inserting (4.3) and (4.4) into ( |4.2[ ), we expect that 



{Un^J')ik ^ An, /c = 1,2, . . . ,m. 

Therefore, "on average", has at least m entries each of which is at least as large as 
An- This says that, first, the correlated case is easier for detection than the uncorrelated 
case. Second, applying standard HC to UnX yields a larger power than applying it to X 
directly. 

Next we make the argument more precise. Fix a positive sequence {dn : n > 1} that 
tends to as n diverges to oo, and a sequence of integers {6„ ■ n > 1} that satisfy I < bn < n. 
Let 

k — bn 

Q*n{6n,bn) = {Sn G Qn, ^ | [/„(!;„) (A:, j) | < for all k satisfying bn + l<k <n}. 

i=i 

Introducing 0* seems a digression from our original plan of focusing on 0* (the set of 
matrices with polynomial off-diagonal decay), but it is interesting in its own right. In fact, 
compared to 0* , 0* is much broader as it does not impose much of a condition on k) 
for |j — /c| < bn- This helps to illustrate how broadly the aforementioned phenomenon holds. 
The following theorem is proved in Section [T0| 

Theorem 4.1 Fix j3 G (1/2, 1) and r G {p*{(3), 1). Let bn = and let 6n be a positive 

sequence that tends to as n diverges to oo. Suppose we apply standard Higher Criticism 
to Un(J^n)X and we reject Hq if and only if the resulting score exceeds l.Ol-v/2 log log n. 
Then, uniformly in all sequences ofT,n satisfying S„ E 0*(5„,6„), 

PHo{Reject Hq} + P (n){Accept Hq} 0, n ^ oo. 
Generally, directly applying standard HC to X does not yield the same result (e.g. [21j). 

4.2 Innovated Higher Criticism: Higher Criticism based on innovations 

We have learned that applying standard HC to UnX yields better results than applying it 
to X directly. Is this the best we can do? No, there is still space for improvement. In fact, 
HC applied to UnX is a special case of innovated Higher Criticism to be elaborated in this 
section. Innovated Higher Criticism is even more powerful in detection. 

To begin with, we revisit the vector via an example. Fix n = 100; let S„ be a 
symmetric tri-diagonal matrix with 1 on the main diagonal, 0.4 on two sub-diagonals, and 
elsewhere; and let ^ be the vector with 1 at coordinates 27, 50, 71, and elsewhere. Figure|2] 
compares fi and UnH- Especially, the nonzero coordinates of UnH appear in three visible 
clusters, each of which corresponds to a different nonzero entry of fi. Also, at coordinates 
27, 50, 71, f7„/i approximately equals to 1.2, but fi equals 1. 

Now we can either simply apply standard HC to UnX as before, or we can first linearly 
transform each cluster of signals to a singleton, and then apply the standard HC. Note that 
in the second approach, we may have fewer signals, but each of them is much stronger than 
those in UnX. Since the HC test is more sensitive to signal strength than to the number 
of signals, we expect that the second approach yields greater power for detection than the 
first. 
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Figure 2: Comparison of fi (left) and Un{^n)fJ' (right). Here n = 100 and S„ is a symmetric 
tri-diagonal matrix with 1 on the main diagonal, 0.4 on two sub- diagonals, and elsewhere. 
Also, /i is 1 at coordinates 27, 50, and 71 and elsewhere. In comparison, the nonzero 
entries of CXra(S„)/i appear in three visible clusters, each of which corresponds to a nonzero 
coordinate of fx. 



In light of this we propose the following approach. Write C/„ = {ukj){i<k,j<n}- We pick 
a bandwidth 1 < 6„ < n, and construct a matrix tlnipn) = Un{^n-,bn) by banding Un 

We then normalize each column of Un{bn) by its own -£^-norm, and call the resulting matrix 
Un{bn)- Next, defining 

Vn{bn) = Vnibn, S„) = C7;(6„; S„) • C/„(S„), (4.6) 



we transform Model (2.1) into 

X ^ Vn{bn)X = Va{bn)fl + Vn{bn)Z. (4.7) 

Finally, we apply standard Higher Criticism to Vn{bn)X, and call the resulting statistic 
innovated Higher Criticism, 

iHC:(6„) = iHC:(6„; S„) = — =L= sup (v^-^l2^M=\. (4.8) 



Note that standard HC applied to UnX is a special case of iHC* with bn = ^■ 

We briefly comment on the selection of the bandwidth parameter 6„. First, for each k £ 



{ii,i2, • • • , im}, direct calculations show that (K(6n)/^)fc ~ An ■ yYlj=i ""I^fc-j+i > ^n- 
Second, Vn{bn)Z ~ N(0, C7;(6„)C7„(6„)), where K{bn)Un{b ^) is a banded correlation matrix 
with bandwidth 26„ — 1. Therefore, choosing bn involves a trade-off: a larger bn usually 
means stronger signals, but also means stronger correlation among the noise. While it is 
hard to give a general rule for selecting the best bn, we must mention that in many cases, 
the choice of bn is not very critical. For example, when T,n has polynomial off-diagonal 
decay, a logarithmically large 6„ is usually appropriate. 
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4.3 Upper bound to detectability 

We now establish an upper bound to detectability. The following lemma describes the 



signal strength in Vn{bn) ■ X and is proved in Section 11 



Lemma 4.1 Fix cq > 0, A > 1, and M > 0. Consider a sequence of bandwidths bn that 
tends to oo. Let {li,£2, ■ ■ ■ Am} be the m random locations of signals in ^, arranged in the 
ascending order. For sufficiently large n, there is a constant C = C(co,A, M) such that, 
except for an event with asymptotically vanishing probability, 



{Vn{bn)li)k > (1 - C6y2-A ^ . ^Y.n\k,k) • A„, V € {^i, £2, • • • 

for all Tin G ©nC'^! ^O; ); where o(l) tends to algebraically fast. 
Now, suppose the diagonal entries of has a lower limit as follows, 

lim f min {T-\k,k)]) = jo- (4.9) 
Recall that the nonzero coordinates of [i is modeled as An = \/2rlogn. So if we let 



bn = logn, then a direct result of Lemma 4.1 is that the vector V^(5„) • X has at least 



m nonzero coordinates, each of which is as large as ^J^An = ^'^''iQ • r ■ log n. For the 
bandwidth, note that a larger 5„ cannot improve the signal strength significantly, but may 
yield a much stronger correlation in Vn{bn)Z. Therefore, a smaller bandwidth is preferred. 
The choice bn = logn is mainly for convenience, and can be modified. 

We now turn to the behavior of zffC* (5„) under the null hypothesis. In the independent 
case, iHC* reduces to HC* and is approximately equal to V2 log logn. In the current 
situation, iHC* is comparably larger due to the correlation. However, since the selected 
bandwidth is relatively small, iHCn remains logarithmically large. This is formally captured 
by the following lemma. 

Lemma 4.2 Take the bandwidth to be bn = logn and suppose Hq is true. Then, except 
for an algebraically small probability, iHCnipn) < C(logn)^/^ for some constant C > 0, 
uniformly for all correlation matrices. 



Lemma 4.2 is proved in Section 11 The key is to express iHCn ^-s the maximum of 



(26^—1) standard HC, and apply the well-known Hungarian construction [10] . The following 



theorem elaborates on the upper bound, and is proved in Section 10 



Theorem 4.2 Fix cq > 0, A > 1, and M > 0, and set bn = logn. Suppose 70 • ?" > P*i(3). 
If we reject Hq when iHCn{bn',Tn) > (logn)^, then, uniformly in all S„ € 0*(A,co,M), 

PH(,{Reject Hq} + P (n){Accept Hq} — > 0, as n ^ 00. 

The cut-off value (logn)^ can be replaced by other logarithmically large terms that tend 
to 00 faster than (logn)^/^. For finite n, this cut-off value may be conservative. In Section 
[s] (i.e. experiment (a)), we suggest an alternative where we select the cut-off value by 
simulation. 

In summary, a lower bound and an upper bound are established as r = 7()'^P*(/5) ^-nd r = 
7o~^/0*(/?), respectively, under reasonably weak off-diagonal decay conditions. When 70 = 
7o, the gap between the two bounds disappears, and iHC is optimal for detection. Below, 
we investigate several Toeplitz cases, ranging from weak dependence to strong dependence; 
for these cases, iHC is optimal in detection. 
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5 Application in the Toeplitz case 



In this section, we discuss the case where S„ is a (truncated) ToepUtz matrix that is gener- 
ated by a spectral density / defined over (— vr, vr). In detail, let at = (2^)"^ f{9) e''^^ d6 
be the k-th Fourier coefficient of /. The n-th truncated Toeplitz matrix generated by / is 
the matrix T,n{f) of which the {j, k)-th element is aj-k, for 1 < j,k < n. 
We assume that / is symmetric and positive, i.e. 



Coif) = essinf„^<e<^/(6l) > 0. 



(5.1) 



First, note that / is a density, so oq = 1 and Sn(/) has unit diagonal entries. Second, from 
the symmetry of /, it can be seen that S„(/) is a real-valued symmetric matrix. Last, it is 
well-known [5] that the smallest eigenvalue of S„(/) is no smaller than Co(/), so S„(/) is 
positive definite. Putting all these together, Sn(/) is seen to be a correlation matrix. 

Toeplitz matrices enjoy convenient asymptotic properties. In detail, suppose that ad- 
ditionally / has at least A bounded derivatives (interpreted in the sense of conventional 
derivatives together with a Holder condition), where A > 1. Then by elementary Fourier 
analysis, there is a constant Mq = Mo(/) > such that 

\ak\ < Mo{f){l + k)-^ for A; = 0,l,2,.... (5.2) 



Comparing (5.1) and (5.2) with the definition of 0*, we conclude that 

S„Ge;(A,co(/),Mo(/)). (5.3) 

In addition, it is known that the inverse of S„(/) is typically asymptotically equivalent 
to the Toeplitz matrix generated by 1//. In particular we have the following lemma, which 
is a direct result of [51 Theorem 2.15]. 

Lemma 5.1 Suppose (5.1 ) and (5.2 ) hold. For all yjn < k < n — ^/n and each 1 < A' < A, 
\^-Hf){k,k) - ^^{l/f){k,k)\ < Cn-i^'-^)l\ 

The diagonal entries of S„(l//) are the well-known Wiener interpolation rates |37] : 



1 

2^ 



1 



d9. 



Therefore, as a direct result of Lemma 



5.1 



Comparing this with (3.4) and (4.9) we deuce that 

70 = 70 = C{f). 



max^<fc<„_^{ I S„ 1 (/)(A;, /c) - C(/) I } 



(5.4) 

0(1). 

(5.5) 



Combining (5.3) and (5.5), the following theorem is a direct result of Theorems 3.2 and 4.1 



(the proof is omitted) . 

Theorem 5.1 Fix A > 1, and let T,n{f) be the Toeplitz matrix generated by a symmetric 
spectral density f that satisfies (|5.1|) and (5.2). When C{f) ■ r < p*{(3), the null and 



alternative hypotheses merge asymptotically, and the sum of Type I and Type II errors of 
any test converges to 1 as n diverges to oo. When C{f) ■ r > p*{(3), suppose we apply iHC 
with bandwidth 6„ = logn and reject the null hypothesis when iHC* (6^, S„(/)) > (logn)^. 
Then the Type I error of iHC converges to zero, and its power converges to 1. 

The curve r = C{f)~^p*{l3) partitions the f5-r plane into the undetectable region and 
the detectable region, similarly to the uncorrelated case. The regions of the current case 
can be viewed as the corresponding regions in the uncorrelated squeezed vertically by a 
factor of 1/C(/). See Figure [3} (Note that C{f) > 1, with equality if and only if / = 1, 
which corresponds to the uncorrelated case.) 
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Figure 3: Phase diae ram in the case where is a Toephtz matrix generated by a spectral 
density /. Similarly to that in Figure [T| the /3-r plane is partitioned into three regions — 
-undetectable, detectable, estimable — each of which can be viewed as the corresponding 
region in Figure [T] squeezed vertically by a factor of 1/C(/). In the rectangular region 
on the top, the largest signals in • X (see (4.6)) are large enough to stand out by 

themselves. 



6 Extension: when signals appear in clusters 

In the preceding sections (see e.g. ( |2.3[ ) in Section [2]), the m locations of signals were 
generated randomly from {1,2,. . . ,n}. Since m <C \/n, the signals appear as singletons 
with overwhelming probabilities. In this section we investigate an extension where the 
signals may appear in clusters. 

We consider a setting where the signals appear in a total of m clusters, whose locations 
are randomly generated from {1, 2, . . . , n}. Each cluster contains a total of K consecutive 
signals, whose strengths are goAn, giAn, gx-iAn, from right to left. Here, An = 
^2rlogn as before, > 1 is a fixed integer, and gi are constants. Approximately, the 
signal vector can be modeled as follows. 

As before, let £i,i2, ■ ■ ■ ,£m be indices that are randomly sampled from {l,2,...,n}. 
Let /i = (^1, . . . ,/in)"'", where fij = An if j G {^1,^2, • • • ,(-m}, and = otherwise. Let 
B = Bn denote the "backward shift" matrix, with in every position except that it has 1 
in position (j + 1, j) for 1 < j < n — 1. Thus, Bfj, differs from fi in that the components are 
shifted one position backward, with added at the bottom. We model the signal vector as 

v = goli + g2Bii + ...guB^ V = ( XI ) ^ ' 

Thus, V is comprised of m clusters, each of which contains K consecutive signals. Let g 
be the function g{G) = Ylo<k<K-i9k ■ We note that ^o<k<K-i 9k B^ is the lower 
triangular Toeplitz matrix generated by g. With the same spectral density /, we consider 
an extension of that in Section |5] by considering the following model: 

X = T.n{g)^^ + Z where Z ~ N(0, S„(/)) , (6.1) 

with / denoting the spectral density in Section [5} The model is closely related to that by 
Arias-Castro et al. pi, with gi = 1, m = 1, and / = 1. See details therein. 
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We note that the model can be equivalently viewed as 

X = iJi + Z where Z~N(0,S„) and = ^-\g) ■ ^n{f) ■ ^n\9) . 



with g denoting the complex conjugate of g. Asymptotically, 
where the diagonal entries of S„(|(7p//) are 

1 r \g{e)? 



C{f,g) 



2vry_^ f{9) 



-de. 



If 7o and 70 are as defined in (3.4) and (4.9), then 70 = 70 = C{f,g), and we expect the 
detection boundary to be r = C{f , g)"^ • p* (P) . The is affirmed by the following theorem, 
which is proved in Section [To| 



Theorem 6.1 Fix A > 1. Suppose go ^ and let f be a symmetric spectral density that 



satisfies (5.1) and (5.2). When C{f,g) ■ r < p*{l3), the null and alternative hypotheses 
merge asymptotically, and the sum of Type I and Type II errors of any test converges to 1 
as n diverges to 00. When C{f,g) ■ r > p*{P), if we apply iHC to T,~^{g)X with bandwidth 
bri = logn and reject the null hypothesis when iHC*{bn,^n^{g)'Sn{f)'^n^{g)) > (logn)^, 
then the Type I error converges to zero, and the power converges to 1. 



7 The case of strong dependence 

So far, we have only discussed weakly dependent cases. In this section, we investigate the 
case of strong dependence. 

Suppose that we observe an n-variate Gaussian vector X = p + Z, where p contains a 
total of m signals, of equal strength to be specified, whose locations are randomly drawn 
from {1, 2, . . . , n} without replacement, and Z ~ N(0, S„) where we assume that S„, displays 
slowly decaying correlation: 

S„(j,A:)=max{0,l-|i-fcrn-°o}, l<j,k<n, (7.1) 

with a > and < ao ^ Q^- The range of dependence can be calibrated in terms of ko = 
ko{n; a, ao), denoting the largest integer by A; < rf^^l^ . Clearly, feo ~ n°'°/°' . Seemingly, the 
most interesting range is < a^/a < 1. The following lemma establishes cases for which 
Sr, is a correlation matrix. 



Lemma 7.1 Let S„ be as in (7.1). For sufficient large n, a necessary and a sufficient 



condition for S„ to be positive definite are, respectively, < a < 2 and < ao < a < 1. 



Lemma 7.1 is proved in Section 12 Model (7.1) has been studied in detail by Hall and 



Jin |2T|, who showed that the detectability of standard HC is seriously damaged by strong 
dependence. However, it remains open as to what is the detection boundary, and how to 
adapt HC to overcome the strong dependence and obtain optimal detection. This is what 
we address in the current section. 

The key idea is to decompose the correlation matrix as the product of three matrices 
each of which is relatively easy to handle. To begin with we introduce a spectral density, 

00 

f^{e) = 1 - ^ [(fc + 1)° + (fc - 1)° - 2A;"] cos{ke) . (7.2) 

k=l 
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0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 

Figure 4: Display of C(/q,, 50) • x-axis: a. y-axis: C{fa,go). 



Note that the Fourier coefficients of fa{d) satisfy the decay condition in (5.2 ) with A = 2 — a 



Also, we have the following lemma, which is derived in Section 12 
Lemma 7.2 For < a < 1, we have essinf„7r<e<7r{/a(6')} > 0. 
Next, let 



9o{0) = 1 



-ie 



a„(ao) = n"V2. 



The Toeplitz matrix T,n{go) is a lower triangular matrix with I's on the main diagonal, — I's 
on the sub-diagonal, and O's elsewhere. Additionally, let Z)„ be the diagonal matrix where on 
the diagonal the first entry is 1 and the remaining entries are ^/an■ Let X = Dn ■ Sn(fi'o) 



Then Model (7.1 ) can be rewritten equivalently as 



X = jl + Z where fl = Dn • Sn(5o) " and Z ~ N(0, Sn), 



(7.3) 



with Tin = Dn ■ T,n{go) ■ ' ^^(^o) ' Dn- The key is that S„ is asymptotically equivalent 
to the Toeplitz matrix generated by fa- In detail, introduce 







The following lemma is proved in Section 11 



Lemma 7.3 The spectral norm of T,n — S„ tends to as n tends to 00. 

Additionally, note that fl = y/a^ ■ T,n-i{g) • n except for the first coordinate. Therefore, we 



expect Model (7.3) to be approximately equivalent to 

X = ./a:n ■ ^n{go) ■ fJ- + Z where Z ~ N(0, 

This is a special case of the cluster model we considered in Section [6] with f = fa and g = go-, 
except that the signal strength has been re-scaled by ^/a^- Therefore, if we calibrate the 
nonzero entries in ji as 



-1/2 



a/ 2r log 



n, 



(7.4) 
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then the detection boundary for the model is succinctly characterized by 



C{fa,go) 



p*{(3), C{U,go) 



1 

2^ 



fa{0) 



dO 



TT 



1 - cos{e) 

fa{0) 



de. 



See Figure |4] for the display of C{fa,go 
Theorem 7.1 Let < ao < a < I, P e {^,1 



The following theorem is proved in Section 10 
and r £ {0,1 



Assume X is generated 
according to Model (7.1), with signal strength re-scaled as in (7.4). When C{fa,go) ■ r < 
p*{f3), the null and alternative hypotheses merge asymptotically, and the sum of Type I and 
Type II errors of any test converges to 1 as n diverges to oo. When C{fa,go) " "f" > P*{P), if 
we apply the iHC to X with bandwidth bn = logn and reject the null when iHC*{bn, S„) > 
(logn)^, then the Type I error converges to zero, and its power converges to 1. 




-0.4 -0.2 0.2 0.4 -0.4 -0.2 0.2 0.4 




-0.4 -0.2 0.2 0.4 -0.4 -0.2 0.2 0.4 



Figure 5: Sum of Type I and Type II errors as described in experiment (a). From top to 
bottom then from left to right, (/3,r) = (0.5,0.2), (0.5,0.25), (0.55,0.2), (0.55,0.25). In 
each panel, the j;-axis displays p, and three curves (blue, dashed-green, and red) display 
the sum of errors corresponding to HC, HC-a, and HC-b. 



8 Simulation study 

We conducted a small-scale empirical study to compare the performance of iHC and stan- 
dard HC. For iHC, we investigate two choices of bandwidth: 6„ = 1 and 5„ = logn. In this 
section, we denote standard HC, iHC with 6„ = 1, and iHC with 6„ = logn by HC, HC-a, 
and HC-b correspondingly. 

The algorithm for generating data included the following four steps: (1). Fix n, (3, 
and r, let m = n^~^ and An = ^2r\ogn. (2). Given a correlation matrix S„, generate 
a Gaussian vector Z ~ N(0,S„). (3). Randomly draw m integers li < I2 < ■ ■ ■ < tm 
from {l,2,...,n} without replacement, and let /i be the n-vector such that /Xj = An if 
j G {^1,^2, • • • ,^m} and otherwise. (4). Let X = p + Z . Using data generated in this 
manner we explored three parameter settings, (a)-(c), which we now describe. 
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In experiment (a), we took n = 1000 and as the tri-diagonal Toeplitz matrix 

generated by f{6) = 1 + 2pcos{6), \p\ < 1/2. The corresponding detection boundary was 



r = P*{P)/C{f) with C{f) = (27r)-V:.T: 



^d9. Consider ah p that range from 

(0.5,0.2), 



-2pcos(6»)' 

—0.45 to 0.45 with an increment of 0.05, and four pairs of parameters (/?, r) 
(0.5,0.25), (0.55,0.2), and (0.55,0.25). (Note that the corresponding parameters (m,^„) 
are (32,1.66), (32,2.63), (22,1.66), and (22,2.63)). For each triple {P,r,p), we generated 
data according to (l)-(4), applied HC, HC-a, and HC-b to both Z and X, and repeated 
the whole process independently 100 times. As a result, for each triple (/3, r, p) and each 
procedure, we got 100 HC scores that corresponded to the null hypothesis, and 100 HC 
scores that corresponded to the alternative hypothesis. 

We report the results in two different ways. First, we report the minimum sum of Type 
I and Type II errors (i.e. the minimum of the sum across all possible cut-off values); see 
Figure |5j Second, we pick the upper 10% percentile of the 100 HC scores corresponding to 
the null hypothesis as a threshold (for later references, we call this threshold the empirical 
threshold), and calculate the empirical power of the test (i.e. the fraction of HC scores 
corresponding to the alternative hypothesis that exceeds the threshold). The empirical 
thresholds are displayed in Table [T] (to save space, only part of the thresholds are reported), 

we recommend (logn)^ 



4.2 



and the power is displayed in Figure |6j Recall that in Theorem 
as a cut-off point in the asymptotic setting. For moderately large n, this cut-off point is 
conservative, and we recommend the empirical threshold instead. 

The results suggest that (1). iHC-b outperforms iHC-a, and iHC-a outperforms HC. 
(2). As IpI increases (note that a larger \p\ means a stronger correlation), the detection 
problem is increasingly easier, and the advantage of iHC is increasingly prominent. (3). 
Under the null hypothesis, the HC-b scores are usually smaller than those of HC and HC-a. 
This is mainly due to the normalization term \/2bn — 1 in the definition of iHC (see (4.8)). 




-0.4 -0.2 0.2 0.4 



1 
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0.2 
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0.8 
0.6 
0.4 
0.2 
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Figure 6: Power as described in experiment (a). From top to bottom then from left to right, 
(/3,r) = (0.5,0.2), (0.5,0.25), (0.55,0.2), (0.55,0.25). In each panel, the x-axis displays p, 
and three curves (blue, dashed-green, and red) display the power of HC, HC-a, and HC-b. 

In experiment (b), we took S„ to be the Toeplitz matrix generated by f{9) = 1 -|- 
2 cos(0) -|- 2pcos(20), where p ranged from —0.2 to 0.45 with an increment of 0.05. (the 
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p 


-0.45 


-0.35 


-0.25 


-0.15 


-0.05 


0.05 


0.15 


0.25 


0.35 


0.45 


HC 


3.078 


2.681 


2.849 


2.843 


2.577 


2.613 


2.968 


2.659 


3.078 


3.072 


HC-a 


2.637 


2.889 


2.759 


2.806 


2.689 


2.657 


3.083 


2.788 


2.679 


2.670 


HC-b 


0.973 


0.810 


0.771 


0.805 


0.716 


0.752 


0.817 


0.764 


0.819 


0.938 





Figure 7: Sum of Type I and Type II errors as described in experiment (b). From top to 
bottom then from left to right, (/3,r) = (0.5,0.2), (0.5,0.25), (0.55,0.2), (0.55,0.25). In 
each panel, x-axis displays /), and three curves (blue, dashed-green, and red) display the 
sum of errors corresponding to HC, HC-a, and HC-b. 



matrix E„ is positive definite when p is in this range). Other parameters are the same as 
in experiment (a). The minimum sums of Type I and Type II errors are reported in Figure 
[7| The results suggest similarly that HC-b outperforms HC-a, and HC-a outperforms HC. 

In experiment (c), we investigated the behavior of HC-a/HC-b/HC for larger n. We 
took (/3, r) = (0.5,0.25), n = 500 x (1,2,3,4,5), and S„ as the tri-diagonal matrix in 
experiment (a) with p = 0.4. The sum of Type I and Type II errors is reported in Table [2] 
The results suggest that the performance of HC-a/HC-b/HC improve when n gets larger. 
(Investigation of the case where n was much larger than 2500 needed much greater computer 
memory, and so we omitted it.) 



n 


500 


1000 


1500 


2000 


2500 


HC 


.130 


.150 


.090 


.115 


.085 


HC-a 


.040 


.030 


.015 


.025 


.015 


HC-b 


.025 


.010 


.005 


.005 






Table 2: Display of the sum of Type I and Type II errors in experiment (c) for different n. 
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9 Discussion 



We have extended standard HC to innovated HC by building in the correlation structure. 
The extreme diagonal entries of play a key role in the testing problem. If the extreme 
value has finite upper and lower limits, 70 and 70, then in the /3-r plane, the detection 
boundary is bounded by the curves r = 7o~^ • p*{P) from above and r = ■ p*{P) 
from below. When the correlation matrix is Toeplitz, the upper and lower limits merge 
and equal the Wiener interpolation rate C{f). The detection boundary is therefore r = 
{C{f))~^ • p*{(3). The detection boundary partitions the /3-r plane into a detectable region 
and an undetectable region. Innovated HC has asymptotically full power for detection 
whenever r) falls into the interior of the detectable region, and is therefore optimally 
adaptive. 

9.1 Connection to recent literature 

The work complements that of Donoho and Jin |14] and Hall and Jin |21j . The focus of 
[14] is standard HC and its performance in the uncorrelated case. The focus of [211 is how 
strong dependence may harm the effectiveness of standard HC; what could be a remedy 
was however not explored. The innovated HC proposed in the current paper is optimal for 
both the model in [T3] and that in [21] . 

The work is related to that of Jager and Wellner where the authors proposed a 
family of goodness-of-fit statistics for detecting sparse normal mixtures. The work is also 
related to that of Meinshausen and Rice [32] and of Cai, Jin and Low [7|, where the authors 
focused on how to estimate Cn — the proportion of non-null effects. 

Recently, HC was also found to be useful for feature selection in high dimensional 
classification. See Donoho and Jin |15l [TBj and Hall et al. [20] . The work concerned the 
situation where there are relatively few samples containing a very large number of features, 
out of which only a small fraction is useful, and each useful feature contributes weakly to 
the classification problem. In a related setting, Delaigle and Hall [T3^ investigated HC for 
classification when the data is nonCaussian or dependent 

9.2 Future work 

The work is also intimately connected to recent literature on estimating covariance matri- 
ces. While the study is focused more on situations where the correlation matrices can be 
estimated using other approaches (e.g. [9] [T71 HH]), it can be generalized to cases where 
the correlation matrix is unknown but can be estimated from data. In particular, it is 
noteworthy that it was shown in Bickel and Levina [1] that when the correlation matrix has 
polynomial off-diagonal decay, the matrix and its inverse can be estimated accurately in 
terms of the spectral norm. In such situations we expect the proposed approach to perform 
well once we combine it with that in [1]. 

Another interesting direction is to explore cases where the correlation matrix does not 
have polynomial off-diagonal decay, but is sparse in an unspecified pattern. This is a more 
challenging situation as relatively little is known about the inverse of the correlation matrix. 

Our study also opens opportunities for improving other recent procedures. Take the 
aforementioned work on classification |15| [TE\ |20] for example. The approach derived in 
this paper suggests ways of incorporating correlation structure into feature selection, and 
therefore raises hopes for better classifiers. For reasons of space, we leave explorations along 
these directions to future study. 
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10 Proofs of main results 



In this section we prove all theorems in preceding sections, except Theorems 2.1 and 5.1 



These two theorems are the direct result of Theorems 3.2 and 4.2 so we omit the proofs. 
For simplicity, we drop the subscript n whenever there is no confusion. 

10.1 Proof of Theorem O 



Rewrite the second model in (3.1) as X + S, = + + Z, where independently, Z ~ N(0, S), 
^ ~ N(0,A), // ~ G for A = S* — S and some distribution G. It suffices to show the 
monotonicity in the Hellinger affinity. Denote the density function of N(0, S) by f{x) = 
f{xi,X2,---,Xn), and write dxidx2 ■ ■ ■ dxn as dx for short. Then the Hellinger affinity 



corresponding to the second model in (3.1) is 



MS,A,G) = I {EAEafix - /x - 0) {EaUx - 0)dx. 
By Holder's inequality and Fubini's theorem, A, G) is not less than 

I [EAVEGf{x-fi-Of{x-0]dx = Ea[I VEcfix - /X - Ofix - Odx] . 



Note that / y^Ecfix- n-C)f{x-C)dx = J y^EGfix-fi)f{x)dx for any fixed ^. It 
follows that 



/i(S,A,G) > Ea 



I VEcfix-f^-Ofi^-Odx = I V Eg fix - fi)f{x)dx 



where the last term is the Hellinger affinity corresponds to the ffist model of (3.1). Com- 
bining these results gives the claim. □ 

10.2 Proof of Theorem 1372] 

It is sufficient to show that the Hellinger distance between the joint density of X and 
Z converges to as n diverges to oo. By the assumption 70 r < we can choose a 

sufficiently small constant 6 = 5{r, P, 70) such that 70 (1— r < p*{(3). Let jl = — 6, 

let U be the inverse of the Cholesky factorization of S, and let U be the banded version 
of U: 



U{i,j) 



U{i,j), 
0, 



\i- j\ < ^og^{n), 
otherwise. 



Model (2.1) can be equivalently written as 
X = fl + Z 



Z~N(0, (l-(5)"^-S). 



where 

The key to the proof is to compare Model (10.1) with the following model: 



X = fi + Z 



where 



(10.1) 



(10.2) 



In fact, by Theorem 3.1 to establish the claim it suffices to prove that (i) If'U < {1 — 6) ^T, 
for sufficiently large n, and (ii) the Hellinger distance between the joint density of X and 



that of Z associated with Model (10.2) tends to as n diverges to 00. 
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To prove the first claim, noting that S = (^7'C/)~^ it suffices to show {l-5)U'U < U'U. 
Define W = U — U and observe that there is a generic constant C > such that \\U\\ < C 
and < C, whence Wll'tj - = \\W'W + jj'W + W'U\\ < C\\W\\. Moreover, by 

|22l Theorem 5.6.9], for any symmetric matrix, the spectral norm is no greater than the 
£"'^-norm. In view of the definitions of W and 05!i(A, cq, M), the ^"'^-norm of W is no greater 
than (logn)-2(^-i). Therefore, \\U'U - U'U\\ < C\\W\\ < C(log n)-2(-^-i). This, and the 
fact that all eigenvalues of U'U are bounded from below by a positive constant, imply the 
claim. 



We now consider the second claim. Model (10.2) can be equivalently written as X = 
UfL + Z where Z ~ N(0,/„). The key to the proof is that C/ is a banded matrix and // 
is a sparse vector where with probability converging to 1, the inter-distances of nonzero 
coordinates are no less than 3(logn)^ (see Lemma 



11.2 



for the proof). As a result the 
nonzero coordinates of Ufl are disjoint clusters of sizes O(log^n), which simplifies the 
calculation of the Hellinger distance. The derivation of the claim is lengthy, so we summarize 



it in the following lemma, which is proved in Section 11 



Lemma 10.1 Fix [5 £ 1), r £ (0, 1), and 6 £ (0, 1) such that 7o(l - (5)~^r < p*{P). As 



n tends to oo the Hellinger distance associated with Model (10.2) tends to 0. 



10.3 Proof of Theorem O 



Put Y = u = Un{'^n)fJ', and Z = C/„(S„)z. Model (4.1) reduces to 

Y = i^ + Z, Z-N(0,/„). (10.3) 



Recalling that HC*/ \J2 log logn ^ 1 in probability under /foi it follows that PH-gjReject ffo} 
tends to as n diverges to oo, and it suffices to show P („){ Accept — > 0. 



The key to the proof is to compare Model (10.3) with 

y* = zy* + Z where Z ~ N(0, /„), 



(10.4) 



with V* having ra nonzero entries of equal strength (1 — (5n ) ^n w hose locations are randomly 
drawn from {l,2,...,n} without replacement. By (4.2)-(4.3) and the way 0*(5nj&n) is 



defined, we note that Vj > (1 — bn)An for all j G {^1,^2, • • • 5^m}- Therefore, 

signals in v is both denser and stronger than those in v* . (10.5) 



(10.4). 



Intuitively, standard HC applied to Model (10.3) is no "less" than that applied to Model 



We now establish this point. Let -Fo(t) be the survival function of the central 
distribution Xi(0)) and let Fnii) and be the empirical survival function of {y^^}^^^ and 
{(^*)^}fc=i' respectively. Using arguments similar to those of Donoho and Jin |[14j it can 



HCn for short, can be rewritten as 



be shown that standard HC applied to Models (|10.3|) and (|10.4|), denoted by HCn^ and 
be rew: 



sup 

{t: l/n<Fo(t)<l/2} 



/li{Fn{t) - Fo{t)) 



sup 

{t: l/n<F(i{t)< 



n{F*{t) - Fo{t)) \ 
i/2}l ^/Fo{t)F^{t) J' 
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respectively. The key fact is now that the family of non-central x^-distribution ^ ^ 

0} is a monotone likelihood ratio family (MLR), i.e. for any fixed x and 62 > Si > 0, 
P{Xi{S2) > x} > P{xi{6i) > x}. Consequently, it follows from (10.5) and mathematical 
induction that for any x and t, P{F*{t) > x} > P{Fn{t) > x}. Therefore, for any fixed 
a: > 0, 

P{HC^^^ <x} < P{HC^^^ < x}. (10.6) 



Finally, by an argument similar to that of Donoho and Jin |14| Section 5.1], the second 
term in (10.6) with x 
claim. 



1.01-v/21og logn tends to as n diverges to 00. This implies the 

□ 



10.4 Proof of Theorem 1472] 

it suffices to show that P („){ Accept Hq} 0. Put tJ = U{bn), 

^1 



In view of Lemma 



4.2 



V = Vn{bn), Y = VX, u = Vfi, Z = VZ. Model (|47|) reduces to 

Y = i^ + Z where Z ~ N(0, C7'C7). 



(10.7) 



Let Fn{t) and Fo{t) be the empirical survival function of {Y^}'^^^ and the survival func- 
tion of Xi(0)) respectively. Let q = q{P,r) = min{(/3 -|- 7or)^/(47or), 47or} and set i* = 
\/2qlogn. Since 7or < p*{P), then it can be shown that < g < 1 and < ^o(C) ^1/2 
for sufficiently large n. Using an argument similar to that in the proof of Theorem |4.1[ 



iHCl 



sup 



n{Fn{s) - Fois)) 



> 



n(F„(C) - Fo(C)) 



{s: l/n<F„(.)<l/2} V i'^K - l)Fo{s){l - Fo{s)) ^ C^b^ - 1) Fo{tl) {1 - Fo{t*jy 

and it follows that 



P{iHCl < log3/2(n)} < P{ 



V(26„-l)Fo(t*)(l-Fo(t?;)) 



< log3/2(n)}. 



(10. 



It remains to show that the right hand side of ( 10. 8| ) is algebraically small. The proof needs 
detailed calculation which we summarize in the lemma below, the proof of which is given 
in Section [TTl 



Lemma 10.2 Under the condition of Theorem 4-2 the right hand side o/(10.8) tends to 
algebraically fast as n diverges to oo. 

10.5 Proof of Theorem 16.11 

Inspection of the proof of Theorems |3.2| and |4.2| reveals that the condition that S„ is 
a correlation matrix and that E„ G 0*(A,co,M) in those theorems can be relaxed. In 
particular, S„ need not have equal diagonal entries and the decay condition on S„ can be 
replaced by a weaker condition that concerns the decay of C/„ (the inverse of the Cholesky 
factorization of S„), specifically 

\Un{i,3)\<M(l + \i-j\^)-\ 

Let Un{f) be the inverse of the Cholesky factorization of S„(/), and define Un = 
Un{f)T,n{g)- Since T,n{g) is a lower triangular matrix with positive diagonal entries, then 
it is seen that Un is the inverse of the Cholesky factorization of S„. By Lemma 3.2 Un{f) 
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has polynomial ofF-diagonal decay with the parameter A. It follows that [/„, decays at the 
same rate. Applying Theorems |3.2| and |4.2[ we see that all that remains to prove is that 



max {\t-\k,k)-Cif,g)\} 

{\/n<:k<n— 



0. 



(10.9) 



By [5l Theorem 2.15], for any ^/n < k < n — y/n, k — K<j<k + K, and 1 < A' < A, 

- (S„(l//))(fc,j)| = o(n-(i-^')/2). 

Since S"^ = ^n{g) ■ ^n^if) ■ ^n{g), it follows that sup|^<fc<„_^| \t-'^{k,k) - (SnCff) ■ 
S„(l//) • T,n{g)){k, k)\ 0. Moreover, direct calculations show that (Sn(5) • S„(l//) ■ 
S„(g))(A;,A:) = C{f,g), ^ < k < n- 
concludes the proofs. 



n. Combining these results gives (10.9) and 

□ 



10.6 Proof of Theorem O 

Consider the first claim. It suffices to show that the Hellinger distance between X and Z 



in Model (7.3) tends to as n diverges to oo. Since C{fa,go) • < P*{l3), the re is a small 
constant 6 > such that (1 — 6)~^ ■ C{fa,go) • r < p*{j3). Using Lemma 



7.2 



we see that 

is a positive matrix the smallest eigenvalue of which is bounded away from 0. 

and basic algebra that S > (1 — 6)Tin for sufficiently large n. 



It follows from Lemma 



Compare Model (7.3) with 



7.3 



X* = il + Z* where Z* ~ N(0, (1 - 



(10.10) 



By the monotonicity of Bellinger distance (Theorem 3.1), it suffices to show that the 
Bellinger distance between X* and Z* tends to as n diverges to oo. 

Now, by the definition of /i, Jl — • S„(5o) • P = (M") " /^njO, . . . ,0)'. Since 
P{pn 7^ 0} = o(l) then, except for an event with negligible probability, jl = fl. Therefore, 



replacing jl by -^/o^- S„(5fo) ■ P in Model (10.10) alters the Bellinger distance only negligibly. 



Note that the first coordinate of X* is uncorrelated with all other coordinates, and its 
mean equals with probability converging to 1, so removing it from the model only has 



a negligible effect on the Bellinger distance. Combining these properties. Model (10.10) 
reduces to the following with only a negligible difference in the Bellinger distance: 

X*{2 : n) = ^n-iigo)iV^. ■ /i(2 : n)) + Z*(2 : n), Z*{2 : n) ~ N(0, (1 - 6)^n~iifa)), 



where X(2 : n) denotes the vector X with the first entry removed. Dividing both sides by 
\/l — 6, this reduces to the following model: 



X{2 : n) = S„_i((7o) 



/a^- /u(2 : n) 



)+Z(2:n), Z(2:n)~N(0, S„_i(/«)), (10.11) 



which is in fact Model (6.1) considered in Section [6[ It follows from (7.4) that ^/a^ ■ /i(2 : 
n) — 6 has m nonzero coordinates each of which equals y^2(1 — 6)~^rlogn. Comparing 
Model (|10.11[) with Model and recalling that (1 - S)'^ ■ r ■ C{fa,go) < p*{(3), the claim 



follows from Theorem 16.11 

Consider the second claim. Since C{fa,go) ■ > P*{P): then there is a small constant 
6 > such that {1 — 5) • r ■ C{fa,go) > P*{f3)- Let [/„ be the inverse of the Cholesky 
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factorization of S„, and let Un{bn) and Vn{bn) be as defined right below (4.5). Write Model 
(7.1) equivalently as 

VX = Vii + VZ where VZ ~ N(0, C7'(6„)C7(6„)). 

Recall that U'{bn)U{b n) is a banded correlation matrix with bandwidth 2bn — 1- Let 
ii,i2, ■ ■ ■ ,£m be the m locations of nonzero means of /i. By an argument similar to that in 
the proof of Theorem 4.2, all remains to show is that, except for an event with negligible 
probability, 

(yf^)k > V 2r-' log n for some constant r' > p*{l3) and all k G {h42, ■ ■ • ,4}- (10.12) 



We now show ( 10.12 ). First, by Lemma 4.1 and (7.4 1, except for an event with negligible 
probability, 

{Vl,)k > (1 - 5)'/' • (a„ • k)r^/^ • An, k e {h,i2, . . .,im}- 

Second, by the way S„ is defined, 

{anT.~^){k, k) = (i;„(5o) • ■ ^nigo)){k, k), for all k>2, 
and by the way S„ is defined and Lemma 



7.3 



for sufficiently large n. 



> (1 - and so S„(5o)S-iS„(5o) > (1 - 6)'/^J:n{9)K^^n{9). 

Last, by [3| Theorem 2.15], \{T.n{go) ■'^^^ ■T.n{go))ik, k)-C(Ja, go)\ = o(l) when min{A;, n- 
k} is sufficiently large. Combining these results gives ( 10.12 ) with r' = (1 — 6) -r •C{fa, go), 
and the claim follows directly. □ 



11 Appendix 



This section contains proofs for all lemmas in preceding sections, except Lemmas 3.1 5.1 



7.1, and 7.2 Lemma 3.1 is the direct result of 



[5J Theorem 2.15], so we omit the proofs. Lemmas 7.1 and 7.2 are proved in Section 12 



and Lemma 5.1 is the direct result of 



11.1 Proof of Lemma O 

Consider the first claim. Construct an infinite matrix Sqo by arranging the finite matrices 
along the diagonal, and note that the inverses of Sqo is the matrix formed by arranging the 



inverse of the finite matrices along the diagonal. Since Soo(i, j) < M(l + \i — j 



X\-l 



then 



applying Lemma 3.1 gives the claim. 



Consider the second claim. It suffices to show that \Un{k,j)\ < C/(l + \ k 



^) for all 

1 < j < k < n. Denote the first k x k main diagonal sub-matrix of S„ by S^, the k-th row 
of Sfc by 1), and the k-th row of [/„ by n'^. It follows from direct calculations that 



4 = (1 - • i^k-i^k-v !)• 

At the same time, by ( |11.1[ ) and basic algebra, 

(l < u'kUk = S^^(fc,A;). 



(11.1) 



(11.2) 
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Combining and ( |11.2 ) gives 

\Un{k,j)\ = \ukij)\<C\{^,:\^k^l)j\ 



l<j<k-l. 



(11.3) 



Now, by Lemma 3.1, |S^i^(j,s)| < C(l + \ j - s|^)~^ for all 1 < i,i < - 1. Note that 
|^fc„i(s)| < C(l + |s — k\^)^^, 1 < s < n, and A > 1. It follows from basic algebra that 



(11.4) 

□ 



c c 



Inserting (11.4) into (11.3) gives the claim. 
11.2 Proof of Lemma 14.11 



Without loss of generality, assume £i < £2 < ■ ■ ■ < ^m- By Lemma |11.2[ except for an 
event with negligible probability, ii > bn, £m ^ n — bn, and the inter-distances of the ij^s 
> Clogn • n2/3-i. For any k E {^1,^2, • • • ,^m}, let 4 = (Ej^fc""' ^\)"'^'- By the way 
tJ{bn) is defined. 



s,j=l 

Consider dk first. Write 



UksUsjfJ-j - ^ {Uks - Uks)UsjlJ'j 
s,j=l s,j=l 



(11.5) 



k+b„-l 
j=k 



2 _ \ " 2 \^ 2 

'^jk - Z^'^jk- Uj^. 



j=k 



j=k-b„ 



First, [/'[/ = S-i, Y.']=ku]k = {U'U){k,k) = (S-i)(fc,A:). Second, by the polynomial 
off-diagonal decay of U and basic calculus. 



n n 



j=k+b„ 



j=k+bn 



Last, note that the quantities S ^(/c, k) are uniformly bounded away from and cxd. Com- 
bining these results gives 

14 - V^-Hk,k)\ < Cbi~'\ (11.6) 

Consider j^iUksUsjfJ-j next. Recall that /Zj = An when j G {ii,i2, ■ ■ ■ ,^m} and 
/ij = otherwise. Since U'U = S^^, 



UksUsjUj = '^{T, ^){k,j)^j = An'E ^{k,k) + An'^T. ^{kjs)- 

s,j=l j=l £s^k 



Define L„ = n^~^/^. By Lemma 11.2 except for an event with negligible probability, the 
inter-distance of ij is no less than L„. So by the polynomial off-diagonal decay of the 
second term is algebraically small. Therefore, 



UksUsjfJ.j = An[{^-^){k,k) + 0(6^-^)]. 



:ii.7) 
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Last, we consider Yl^ j=ii''^ks — Uks)Usj^j. Direct calculations show that 



c 



so by a similar argument, 



i+|fc-j|^' 



A > 1, 
A = l, 



{Uks - Uks)Usjflj 

s,j=l 



Y,{{u-uyu){k,j)f,, 



<An-{{u-uyu){k,k) + o{i), 

where o(l) is algebraically small. Moreover, by the Cauchy-Schwartz inequality, 

n 

{{U - UyU){k, k) < ^\{uks - Uks)usk\ < bl/^-\ 

s=l 

and the claim follows. □ 
11.3 Proof of Lemma 14.21 

Without loss of generality, suppose n is divisible by 2bn — 1, and let N = N{n,bn) = 
n/{2bn — 1). Let pi be iid samples from C/(0,1), and F]y{t) be the empirical cdf. The 
normalized uniform stochastic process is defined as 



W7v(t) = VN[FN{t) - 



The following lemma is proved in Section |11.3.1[ 

Lemma 11.1 There is a generic constant C > such that for sufficiently large n, 
P{ sup |Wjv(t)| > C(logn)3/2} < Cn~^. 

{l/n<t<l/2} 



We now prove Lemma 4.2 Define Y = U'UX. Under the null hypothesis, Y ~ 
N(0, U'U) and the coordinates Y^ are block-wise dependent with a bandwidth < 2bn — 1. 
Split the Yfc's into 26„ — 1 different subsets = {1^ '■ k = j mod (26„ — 1)}, 1 < j < 26„ — 1. 
Note that the l^'s in each subset are independent, and that \^j\ = N , 1 < j < 2bn — 1. 



Let Fn{t) and FQ^t) be as in the proof of Theorem 4.1 and let 
2b„-l 



Note that = 25~r X^i=i ^ ^njit). By arguments similar to that of Donoho and Jin 

|14j . and basic algebra, it follows that 

/S(^„(t)-W) 1/^' 



1 < J < 26„ - 1. 



k=l 



iHC* = sup', , 

t I V(26„ - l)Fo(t)Fo(t) 



iV(F„,,(t)-Fo(t)) ] 



and so for any x > 0, 



P{iHC* >x}< 



2b„-l . . 



NjFnjit) - Fo{t)) 
^/Fo{t)Fo{t) 
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Finally, since that F^jS are the empirical survival functions of N independent samples 
from Xi(0)5 then 



sup 



N{Fn,j{t) - Foit)) 



l/n<Fo(i)<l/2 L 

Therefore, 



sup {WAr(t)} in distribution. 

{l/n<t<l/2} 



P{iHC*>x}<{2bn-l)P{ sup {WN{t)}>x}. 

{l/n<t<l/2} 



Taking x = C(logn)^/^, the claim follows from Lemma 



11.1 



□ 



11.3.1 Proof of Lemma HTa] 

By the Hungarian construction [lOj, there is a Brownian bridge M{t) such that 
P{ sup \^{FN{t)-t)-M{t)\ > '^(^Qg^+^) } < ce-^"", 

{l/n<t<l/2} VN 



where C > are generic constants. Noting that 1/Y^t(l — t) < y/n < Cy/N log when 
1/n < t < 1/2, it follows that 



P<j sup 

.{l/n<i<l/2} 



N{FN{t)-t)-M{t) 



At the same time, by [33), Page 446], 



> C{logN)^/^{logN + x)^ < C7e-^^. (11.8) 



P<, sup 

.{l/n<t<l/2} 



> C{logN)^/^x } <ClogN-e 



-Cx 



(11.9) 



Combining (11. 8)- (11. 9), taking x = ClogN and using triangle inequality, gives the claim. 

□ 



11.4 Proof of Lemma 17.31 

By direct calculations and the way S is defined, we have 



1 



^11.10) 



where 



eLi = X (0, . . . , - k^{nr, koinr - (koin) - 1)", . . . , 2° - 1, 1) (11.11) 

and S* is a symmetric matrix with unit diagonal entries, and with the following on the 
k-th sub-diagonal: 

{2A;°-(fc + l)"-(fc-l)", A;<A;o(n)-l, 

1 + ((/fc - 1)" - 2A;")/n"o = 0(n-"o/°), k = ko{n), 

-(1 -{k- l)°/n"o) = 0(n-°o/"), k = ko{n) + 1, 

0, A;>A;o(n) + 2. 



26 



Note that S„_i((7o) and S* share the same 2A;o(n) — 1 sub-diagonals that are closest to the 
main diagonal (including the main diagonal). Let Hi be the matrix containing all other 
sub-diagonals of S„_i((7o), and let H2 be the the matrix which contains the A;o(^)-th and 
the {kQ{n) + l)-th diagonals (upper and lower) of S*. It is seen that 



S - S 



Hi 




+ 



H2 




+ 





c-1 







= Bi + B2 + B3. 



Let II • 111 and || • II2 denote the £^ matrix norm and the £^ matrix norm, respectively. First, 
by direct calculations, since a < 1/2, \\Bi + B2\\i < Cn"o(°-^)/° < Cn""". At the same 
time, by (11.11) and since a < 1/2, 

C 



LB, 



< — y (A;" - (A; + 1)")2 < 

k=l k=l 



c 



Since the spectral norm is no greater than the ^^-matrix norm and the ^^-matrix norm, the 
spectral norm of Bi + B2 + B^ is no greater than Cn"""/^, and the claim follows. □ 



11.5 Proof of Lemma 110.11 



Let o = ^/{l^^I)J^Q, r' = 70(1 - 5)~^r, Ui = all, and = \fL. Model (10.2) can be 
equivalently written as 



X = Ujl + Z = Uiti + Z where Z ~ N(0, /„ 



(11.12) 

Using the argument in the first paragraph of the proof of Theorem |3.2| it is not hard to verify 
that (I) fj, has m 



n 



1-/3 



nonzero entries each of which is equal to \/2r' log n with r' < p*{f3), 
and whose locations are randomly sampled from (1, 2, . . . , n); (II) f/i, where Ui{k, j) = if 
l/c — j| > (logn)^, is a banded lower triangular matrix; and (III) lim„^oo niax|^<;j<„_^l 
{{U[Ui){k,k)} = {l-6)<l. 

From now on, writ e fj, = jl and r = r' for short. Note that the Hellinger affinity 
associated with Model (10.2) is EQ{^yW*), where Eq denotes the law Z ^ N(0,/„), and 



W*=W:{r,p;Zi,Z2,...,Z„ 



Introduce the set of indices 

Sn = {i=ih,i2,...,em), 



min 1-^7+1 

{l<j<m-l} 



{£={£1/2,-/^)} 

> 3(logn)^,^i > ^/n,n - im > V^}- 

(11.13) 



The following lemma is proved in Section 11.5.1 



Lemma 11.2 Let ii < £2 < • • • < be m distinct indices randomly sampled from 
(l,2,...,n) without replacement. Then for any 1 < K < n, (a) P{ii < K} < Km/n, 
(h) P{£m >n-K}< Km/n, and (c) P{min|i<j<^_i}{|£j+i - li\ < K} < Km{m+l)/n. 
As a result, P{i = {£1,(2, ■■■ ,(m) i Sn} = 0{(log n)^ n^-^/S} = ^(i). 



Applying Lemma 11.2, we make only a negligible difference by restricting I to Sn and 
defining 



1 

FT 

Km) 



= (£l,^2,.--/m)e5„} 



,/.^C/(Z-||C/iM,|lV2 



(11.14) 
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in which case 
Define Y = U[Z, 
and the event 



E{W, 



l/2^ 



E{W:'l')+o{l). 



a] = vai{Yj)^{U[Ui)U,3), 



^ <j <n, 



(11.15) 
(11.16) 



Dn = {Yj/aj < V21ogn, l<j< n}. 

1 /2 

By direct calculation, PjL)^} = o(l), and so by Holder's inequality, E{Wn 1{D=}) = 

E{Wn''^)+o{l). Combining this result and (11.15 ) we deduce that E{W*'^/'^) = Eiyvl^'^ 1{D„}) + 
o(l), and comparing this property with the desired result we see that it is is sufficient to 
show that 

E{W}l''\^r,^) = \-ro{\). (11.17) 



The key to (11.17) is the following lemma, which is proved in Section 11.5.2 



Lemma 11.3 Consider Model (11.12), where Ui and fi satisfy As n ^ oo, 

E{Wn 1{D„}) = 1 + 0(1), and E(H^b„}) = 1 + o(l). 

Since 



then by Holder's inequality. 



II < 



W„l 



n 1{D„} 



1 



l + W^i/'l{D„} 



< |l^nl{D„}-l|, 



{E\Wll^ l|^,j -l\f<\Wn 1{D„} - 1|' = E[W^^ 1|B„|) - 2E{Wn 1{Z)„}) + 1. (11.18) 

:7|. □ 



(11.18 


) with Lemma 


11.3 


gives ( 



11.5.1 Proof of Lemma [TOl 

The last claim follows once (a)--(c) are proved. Consider (a)-(b) first. Fixing K > 1, 

\m-lJ 



P{h = K} 



in — m)(n — m — 1) . . . (n — m — K + 2) 
7^ '- — < m/n, 



n{n - 1) . . . (n - K + 1) 

so P{£i < K} < Km/n. Similarly, P{n — im ^ K} < Km/n. This gives (a) and (b). 
Next we prove (c). Denote the minimum inter-distance of ^i, £2, • • • , by 

L{1) = L{£; m, n) = min|i<j<^_i}{|^i+i - £i\}. 

Note that 

m—l m—l n 

P{L{e) = K} < ^{^,+1 -ej = K}<Y, E^i^i = = k + K}. 

j=i j=i k=i 



Writing P{ij = k, Ij+i = k + K} 



(n\-^ /k~l\ (n-k-K 



• -, , , ■ -, ) , we have: 



n k 



j=l k=j 

where the last term is no greater than 

'n-K -1 
m — 2 

•f A;=i ~ 

and the claim follows. 



1 /k-l\/n- k- K 



1 " 



< 



n 



\ m / 



n 

m — 2 



□ 
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11.5.2 Proof of Lemma [TTTSl 

Define T„ = \/2 logn. We need the following lemma, which is proved in Section 



12 



Lemma 11.4 Consider a bivariate zero mean normal variable {X, Y)' that satisfies Var(X) 
al, Var(y) = ul, and covv{X,Y) = g, where cq < 0"i,cr2 < 1 for some constant cq £ (0, 1). 
Then there is a constant C > such that for sufficiently large n, 

E[e^Y>{AnX - alAl/2) ■ l{y/.,>T„}] < C ■ n'^^-^v^)' < Cn~^'-^'^\ 

2 2 

E[e^p{AniX + Y)- ^^^Al) ■ l{xM<T„,y/..<T„}] < Cn~''^'\ 
where d{r) = min{2r, 1 — 2(1 — y^)^}. 
We also need the following definition. 

Definition 11.1 We say that two indices j and k are near to each other if\j — k\ < (logn)^. 



We now proceed to show Lemma 11.3 Consider the first claim. Note that for any 
i = {£i,i2, ■ ■ ■ )^m) £ Sn, the minimum inter- distance of is no less than 3(logn)^. In view 
of the definition of Yj and aj (see (11.16)), we have 

m m 

WUrnW = Al^{U[Ui){e„£,) = Al^al 



i=l 



i=l 



Consequently, we can rewrite Wn as 



Wn 



, m. .2 ™ \ 



£=(^l/2,.-/m)e5, 



Note that 



1 



i=i 



^11.19) 



(11.20) 



Combining (|11.19|) and (|11.20|) gives 
1 



E[Wn ■ 1{D,}] < TTTY E E ^ E - T E 4 • hn/^.>T,.} 

<.=(£^^...^£^)g5'„ fc=l L V J 



e={ei,...,em)€Sn k=i 

(11.21) 

Now, for each 1 < < n, when k is near one ij, say £jQ, Yk must be independent of all 
other y^. with j ^ Jq. It follows that 



E 



exp ( An E - ^ E < • l{n./-.>T„} = E [exp(^„y,^^ - oIaHi) • l{y,/.,>T„}] • 

V .7 = 1 7 = 1 ^ -I 



By Lemma 11.4 the right hand side is no greater than Cn Therefore, 



E 



exp ( E - ^ E 4 ) • l{n/<Xfe>T„ 
.7=1 ^ 



^11.22) 
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Moreover, for each fixed £ = {ii, . . . , im) £ Sn, there are at most 2m(log n)^ different indices 
k that can be near some of the £j's; and when they are, they can be near only one such ij. 
Combining these results gives 



< 



1 



C(logn)2mn-(^"v^)' < C(log n)^^^!-^)-^^-^^)' 



^mJ ^=(£i,.../„)e5„ 

(11.23) 

By the definition of p*{P) and the assumption of the lemma, r < p*{f3) < (1 — \/l — /3)^, 
and so the first claim follows directly from (11.23). 

We now consider the second claim. Fix < < m, and let S^i^) denote the set of 
all k = {ki,k2, ■ ■ ■ ,km) £ Sn such that there are exactly fej's that are near to one ii. 
(Clearly, any kj can be near to at most one ii). The two sets of indices {ii,i2, ■ ■ ■ , im) and 
(A;i, k2, . . . , km) form exactly pairs, each contains one candidate from the first set and 
one candidate from the second. These pairs are not near to each other, and not near to any 
remaining indices outside the pairs. Using (11.19), we write 

i2 



X E 



E E E 

exp Y.^Y,^ + ^'^«) - -f E(4 + 4) • 



^11.24) 



For any fixed i and k E S^ii), by symmetry, and without loss of generality, we suppose 
the pairs are {£i,ki), (^2,^2), (iN^kiy). By independence of the pairs with other 
indices, and also by independence among the pairs. 



E 



exp ( A„ + ^^.) - -f J2H + <) ) • 1 



< E 



< E 



exp ( An Y^i^ij + Yk,) - -f + ^fc.) ) • l{>^.,M,<T„,y,,M,<T„, for all 1 < j < A^} 

V 7 = 1 7 = 1 / 3 J J J 



N 



exp aJ B^^. +^^.)-^EH + -: 



n 



N 



E 



exp \ An{Yi^ + 



l{^«j/% <r„,yfc^M^.<T„,for all 1 < i < A^} 

(11.25) 



Here, in the first inequality, we have used the fact that 

1{D,J < l{y,^/<T,xT„,yfe^,/afc^<r„,for ah 1 < j < 

in the second inequality, we have utilized the independence and the fact that 

£;[exp(A„y^- - a]Al/2)] = 1, for all j = 1, . . . , n; 

and in the third equality, we have used again the independence. Moreover, in view of the 
way Ui is defined and Lemma 3.2 there is a constant cq S (0,1) such that aj G [co,l]. 

< Cn'^W, (11.26) 



Using Lemma 11.4, for sufficiently large n and each 1 < j < A^, 



E 



A: 



exp An{Ye^ + Yk^) - -^{a^^ + a^^) ■ l{Y,Ja,^<T„,YkJa,^<T„ 
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with d{r) being as in Lemma 11.4 Combining (11.25) and (11.26) gives 



E[Wn ■ 1{D„}] 



< 



0—{0. \ AT—n 



'''Y\SN{e)\. 



(11.27) 



where |S'Ar(^)| denote the cardinality of Sj\f{£). By elementary combinatorics, 



\SN{i)\ < 



2 



[2 log^ n 



m — N 



2 „\Af 



< (2 log^ n 



m 



n 



N J \m- N 



;il.28) 



Direct calculations show that 

{n) (m-Af) 1 



ml \ (n — m)! ^ 1 ,m? 
(;^) ~ N\ \{m-N)\ J {n-m + N)l ~ M^^^ 



(11.29) 



Substituting (|11.28l)-(|11.29l) into (111. 271) and recalling that m = ^, we deduce that 



E[Wn ■ 



< 



n 



m 



-1 



m ^ 2 



,...,im)&SnN=0 



Nl ^ n 



N 



^11.30) 



where the last term < Y1'n=o M (^'log^ n)n^'^'^^'^^ 2/3^^ ^ gy assumption of the lemma, 

r < p*i(3) -- 

thus it can be seen that 1 + d{r) — 2/? < for all fixed /3 and r G (0, p*{(3)). Combining this 
with (11.30) gives the second claim. □ 



/3-1/2, 

(1 - yr^)2. 



1/2 < /3 < 3/4, 
3/4 < /3 < 1, 



11.6 Proof of Lemma 110.21 

The key observation is that, there is a sequence of positive numbers 5n that tends to as 
n diverges to oo such that J^fc > (1 — 5n)An for all k G {^1,^2, • • ■ i^m}-, so it is natural to 
compare Model (10.7) with the following model: 

Y* = v* + Z, Z~N(0,/„), (11.31) 

where v* has m nonzero entries of equal strength (1 — (5„)A„ whose locations are randomly 
drawn from {1, 2, . . . , n} without replacement. 
For short, write t = t* and 



Hn{t) 



y/^{Fn{t) - Fojt)) 

V(26„-l)Fo(t)(l-Fo(t))' 



Let F*{t) be the empirical survival function of {{Y^)^}"^^^, and let F{t) = E[Fn{t)] and 
F*{t) = E[F*{t)]. Recall that the family of non-central x^'distributions has monotone 
likelihood ratio. Then F(t) > F*(t) > -Fo(t). Now, first, since the Y^'s are block- wise 
dependent with a block size < 26„ — 1, it follows by direct calculations that 

Var(/7„(t)) < CF{t)/Fo{t). 



Second, by F{t) > F*{t), 
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E[Hnit)] 



n{F{t) - Fo{t)) 



> 



n{F*{t)-Fo{t)) 



V(26„-l)FoW(l-^o(t)) Vi2bn-l)Fo{t){l-Fo{t)) 



(11.32) 



where the right hand side diverges to oo algebraicahy fast, by an argument similar to that 
in [T3]. Combine Chebyshev's inequality, the identity 6„ = logn, and the calculations of 
the mean and variance of Hn(t), 



P{Hnit)<{lognf}<Cilogn) 



Fit) 



n 



2 • 



(11.33) 



It remains to show that the last term in (11.33) is algebraically small. We discuss 
separately the cases F{t)/Fo{t) > 2 and F{t)/Fo{t) < 2. For the first case. 



<^< ^ 



n(F(t) - Fo(i))2 - nF{t) " nFo(t) 
which is algebraically small since t = ^/2qlogn and < q < 1. For the second case. 
Fit) ^ Chit) ^ CFoit) 



< 



< 



n{F{t) - Fo{t))^ - n{F{t) - Fo{t))^ " n{F*{t) - Fo^ 



(11.34) 



which is seen to be algebraically small by comparing it to the right hand side of (11.32). 
This concludes the claim. □ 



12 Complementary technical details 
12.1 Proof of Lemma 17.11 

Consider the first claim. Suppose such an autoregression structure exists for a > ao > 0. 
Let 

Yk = ^-{Xk+i-Xk)/d, an = n"V2, k = 1,2, . . . ,n - 1. 

Clearly, vai(Yk) = 1. At the same time, direct calculations show that the correlation 
between Yi and l^+i equals to ((j + 1)" + {j — 1)" — 2j") /2 for all 1 < j < n — 2, which is 
no larger than 1. Taking j = 2 yields (3" + 1 — 2 • 2")/2 < 1, and hence a <2. 

Consider the second claim. For any k>l, define the partial sum Sk{t) = 1 + 2 'Y^^j=i{^ ~ 
^aQ)~^ cos{kt). By a well-known result in trigometrics [38^, to show the positive-definiteness 
of Tin, it suffices to show that 

SkQ+i{t) > for all t £ [— vr, vr] and Sko+i{t) > except for a measure set. (12.1) 



Here, /cq = kQ^n; a , gp) is the largest integer k such that /c" < n"°. 



We now show (12.1 ). By that in [SSJ Page 183], if we let qq = 2, and aj = 2(1 



1 < J < n - 1, then Sko+i{t) = E^oHJ + l)^'^ajKj{t) + {ko + l)Kk,{t)Aak, + Dn{t)ak, 



3 ■ 
n"0 . 



+1- 



Here, Aaj = aj — aj^i, A^aj = aj + aj+2 — 2aj+i, and Dj{t) and Kj{t) are the Dirichlet's 
kernel and the Fejer's kernel, respectively, 

sin((i + \)t) 2 (sm{i^t) 
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In view of the definition of ko, a^o+i = (1 — ^^"^qq^ )"*" = 0. Also, by the monotonicity of 
{oj}, Aafco = flfco - Ofco+i > 0. Therefore, Sko+i{t) > Ylf=o^U + l)'^'^ajKj{t). 

We claim that the sequence {ao, ai, ■ • • , a„-i} is convex. In detail, since a < 1, the 
sequence {j"} is concave. As a result, the sequence {(1 — ,fso)} is convex, and so is the 
sequenced {(1 — :^)^}- In view of the definition of aj, the claim follows directly. The 
convexity of Oj's implies that A^aj > 0, < j < n — 2. Therefore, Skg+i{t) > 0. This 



proves the first part of (12.1 ) 



We now prove the second part of ( 12. 1[). We discuss separately for two cases a < 1 



and a = 1. In the first case, Aao = ^^(2 — 2") > and Ko{t) = 5- As a result, 
Sko+i{i) ^ 2ra"o > 0, and the claim follows. In the second case, Aaj = :^(2j — j — 
{j + 2)) = 0, and Aak,-i = (1 - ^) - 2(1 - ^) = ^{ko + 1) - 1 > 0. Therefore, 
Sko+i{t) > {h + 1)(^ - l)Kko{t). Clearly, Sko+i{t) can only assume when (^)t is 
a multiple of vr. Since the set of such t has measure 0, the claim follows directly. □ 

12.2 Proof of Lemma O 

Let ao = 2, and Ofc = 2A;" - {k + 1)" - {k - 1)", l<k<n-l. Clearly, > for ah A;, so 
/ci(0; a) > 0. Furthermore, when 6* 7^ 0, by ^S] Equation 1.7, Page 183], 

00 

fa{e) = ^{1^ + l)[au+2 + a. - 2a,+i]a,K,{e), (12.3) 
1^=0 



where Ky{6) is the Fejer's kernel as in (12.2). By the positiveness of the Fejer's kernel, all 



remains to show is that a^+i + afc_i — 2afc > 0, for all k>2. 

Define h{x) = (1 + 2a;)" + (1 - 2x)" - 4(1 + x)" - 4(1 - x)° + 6, < x < 1/2. By direct 
calculations, for all /c > 2, 

afe+i + afc_i-2afc = -A:-[(l + ^)" + (l-^)"-4(l + i)°-4(l-i)- + 6] = -k^K\). (12.4) 

Also, by basic calculus, h"{x) = 4a(a-l)[(l+2x)"-2 + (i_2a;)"-2-(l+a;)°-2-(l-x)"-2]. 
Since < a < 1, x"~^ is a convex function. It follows that h"{x) < for all x G (0, 1/2), 
and h{x) is a strictly concave function. At the same time, note that /i(0) = /i'(0) = 0, so 



h{x) < for X G (0, 1/2]. Combining this with (12.4) gives the claim. □ 
12.3 Proof of Lemma ITOl 

Denote the density, cdf and survival function of N(0, 1) by (p, <I>, and <I>. For the first claim, 
define W = X/ai and V = Y / 02 '\i p > ^ and V = —Yja^ otherwise. The proofs for two 
cases /O > and p < are similar, so we only show the first one. In this case, it suffices to 
show 

i?[exp(aiA„VF - o\All2) ■ l{y>T„}] < C ■ n-('-s^^\ 
Write W = {W - pV) + pV, and note that (1 - p)^ + < 1. It is seen that 

aiAnW - alAl/2 < [a^AniW - pV) - a?(l - pfAl/2] + [aiA^pV - alp'Al/2]. (12.5) 

Since W and V have unit variance and a correlation p, then {W — pV) is independent of V 
and is distributed as N(0,(l-p)2). Therefore, E[e:^-p{(^iAn{W - pV)-al{l- pf Al/2)] = 1. 



Combining this with (12.5) gives 



E[e^v{cyiAnW - alAl/2) ■ 1{v>t„}] = E[exp{aipAnV - cj\p' All2) ■ l|y>r„} 
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Now, by direct calculations, 

/•oo 

E[exp{AnV - Al/2) ■ l{v>Tn}] = <t>{x - (JipAn)dx = $(r„ - UipAn). 

Since l>(a;) < C(k{x) if x > 0, ^{Tn-aipAn) < C(k{Tn-(JipAn) = CvT^^-P^'^'' . Combining 
these results gives the claim. 

We now show the second claim. By Holder's inequality, it suffices to show that 

ii;[exp(2AX - a\Al) • \x<.,t^}\ < Cn-''^'\ 

Recalling that W = X/ ai , we have 

S[exp(2^„X _ aiAl) ■ l{x<aiT„}] = ^[exp(2(7i^„I^ - alAl) ■ 1{w^<t„}]- 

By direct calculations, 

cPix - 2aiAn)dx = e^?^"$(r„ - 2(7iA)- 

-oo 

Since $(a;) < C<t){x) for all x < and < 1 for all x > 0, 

m view of the definition of d(r), e'^i^'^>(T„ - 2c7iyln) < Cn'^^'^i''). Since that cxi < 1 and 
that d{r) is a monotonely increasing function, we have d{(7\r) < d{r). Combining these 
results gives the claim. □ 
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